Lecture 1.1: Organising Your Work

Week 1 - Data Science Workflows

Dr Zak Varty

Week 1: What are we trying to do?

  • Data science is a collaborative discipline

  • Be a good collaborator, to others and to your future self

  • This week will show one framework to help you with that task

  • Like flossing not difficult but requires discipline

  • We will take an opinionated and R focused approach, ideas transfer to other settings.

One Project = One Directory


  • Sounds easy, but in practice it is not.


  • Requires prospective project scoping.


  • Entropy is ever increasing.


Properties of a well organised project


Ideally, we would like to organise our projects so that they are:


  • Portable
  • Version control friendly
  • Reproducible
  • Integrated Development Environment friendly

Portability

Is your work all in one place or scattered?

Can it be moved to a new location without breaking?

What do we mean by a new location, exactly?

Version Control Friendly


Reproducible



A study is reproducible if you can take the original data and the computer code used to analyze the data and recreate all of the numerical findings from the study.

Broman et al. (2017) “Recommendations to Funding Agencies for Supporting Reproducible Research”

Integrated Development Environment Friendly

  • Possible to code and manage projects entirely in notepad or at the command line.

  • Puts a lot of strain on you, both your fingers and your brain

  • Integrated development environments such as RStudio, PyCharm and VisualStudio aim to reduce this burden

    • Autocomplete / shortcodes
    • Environment panes
    • Templating

Inside the README

  • Name
  • Project Status
  • Description
  • Installation
  • Usage / Examples
  • Support
  • Contributing

Summary


  • Introduced a standardised project structure;


  • Good starting point for most data science projects;


  • Exceptions are apps and packages.

References

Broman, Karl, Mine Cetinkaya-Rundel, Amy Nussbaum, Christopher Paciorek, Roger Peng, Daniel Turek, and Hadley Wickham. 2017. “Recommendations to Funding Agencies for Supporting Reproducible Research.” In American Statistical Association, 2:1–4.